This document is for a workshop to provide some tools, tips, packages etc. that make data processing and visualization in R easier. It is oriented toward those who have had some exposure to R in an applied fashion, but would also be useful to someone coming to R from another programming language. It is not an introduction to R, nor does this document have anything to do with statistical analysis. The goal here is primarily to instill awareness, specifically of tools that will make your data exploration easier.
Color coding in text:
Some key packages used in the following demonstrations and exercises:
dplyr, tidyr, ggplot2, data.table, plotly, ggplot2movies
Many others are also used, feel free to install as we come across them. Here are a few.
DT, highcharter, magrittr, maps, mgcv, plotly, quantmod, readr, visNetwork
R has several core data structures:
Vectors form the basis of R data structures.
Two main types- atomic and lists, but I will treat lists separately.
Here is an R vector. The elements of the vector are numeric values.
x = c(1, 3, 2, 5, 4)
x[1] 1 3 2 5 4
All elements of an atomic vector are the same type. Examples include:
A important type of vector is a factor. Factors are used to represent categorical data structures.
x = factor(1:3, labels=c('q', 'V', 'what the heck?'))
x[1] q V what the heck?
Levels: q V what the heck?
The underlying representation is numeric, but it is important to remember that factors are categorical.
They can’t be used as numbers would be, as the following demonstrates.
as.numeric(x)[1] 1 2 3
sum(x)Error in Summary.factor(structure(1:3, .Label = c("q", "V", "what the heck?": 'sum' not meaningful for factors
With multiple dimensions, we are dealing with arrays.
Matrices are 2-d arrays, and extremely commonly used for scientific computing.
The vectors making up a matrix must all be of the same type.
Creating a matrix can be done in a variety of ways.
# create vectors
x = 1:4
y = 5:8
z = 9:12
rbind(x, y, z) # row bind
cbind(x, y, z) # column bind
matrix(c(x, y, z), nrow=3, ncol=4, byrow=TRUE)Lists in R are highly flexible objects, and probably the most commonly used for data science.
They can contain anything as their elements, even other lists.
Here is a list. We use the list function to create it.
x = list(1, "apple", list(3, "cat"))
x[[1]]
[1] 1
[[2]]
[1] "apple"
[[3]]
[[3]][[1]]
[1] 3
[[3]][[2]]
[1] "cat"
We often want to loop some function over a list.
for(elem in x) print(class(elem))[1] "numeric"
[1] "character"
[1] "list"
Lists can, and often do, have named elements.
x = list("a" = 25, "b" = -1, "c" = 0)
x["b"]$b
[1] -1
data.frames are a very commonly used data structure. They do not have to have the same type of element, and this is because the data.frame class is actually just a list.
But they can also be indexed by row or column as well.
There are other very common types of object classes associated with packages that are both a data.frame and other type of structure.
mydf = data.frame(a = c(1,5,2),
b = c(3,8,1))We can add row names also.
rownames(mydf) = paste0('row', 1:3)
mydf a b
row1 1 3
row2 5 8
row3 2 1
Create an object that is a matrix and/or a data.frame, and inspect its class or structure (use the or functions).
Create a list of 3 elements, the first of which contains character strings, the second numbers, and the third, the data.frame or matrix you just created.
How is a factor different from a character vector?
How is a data.frame the same as and different from a matrix?
How is a data.frame the same as and different from a list?
Standard methods of reading in data include the following functions:
Base R also comes with the foreign package for reading in other types of files, especially other statistical packages. However, while you may see it still in use, it’s not as useful as what’s found in other packages.
haven: Package to read in foreign statistical files
readxl: for excel files
rio: uses haven, readxl etc. but with just two functions for everything
readr: Faster versions of base R functions
These make assumptions after an initial scan of the data, but if you don’t have ‘big’ data, this won’t help much.
However, they actually can be used as a diagnostic.
data.table provides a faster version read.table, and is typically faster than readr approaches.
Note that R can handle many types of data.
Some examples:
And many, many others.
feather: designed to make reading/writing data frames efficient
Works in both Python and R. Still in early stages of development.
Why might you use read_csv from the readr package rather than read.csv in base R?
What is your definition of ‘big’ data?
letters[4:6][1] "d" "e" "f"
letters[c(13,10,3)][1] "m" "j" "c"
myMatrix[1, 2:3]mydf['row1', 'b']mydf[1, 2]mydf['row1', 2]If the row/column value is empty, all rows/columns are retained.
mydf['row1',]
mydf[,'b']mydf[c(1,3),]mydf[mydf$a >= 2,][ : grab a slice of elements/columns
[[ : grab specific elements/columns
$ : grab specific elements/columns
@: extract slot for S4 objects
my_list_or_df[2:4]my_list_or_df[['name']]my_list_or_df$nameThis following is a refresher of base R indexing only.
Here is a matrix, a data.frame and a list.
mymatrix = matrix(rnorm(100), 10, 10)
mydf = cars
mylist = list(mymatrix, thisdf = mydf)For the matrix, in separate operations, take a slice of rows, a selection of columns, and a single element.
For the data.frame, grab a column in 3 different ways.
For the list grab an element by number and by name.
Logicals are objects with values of TRUE or FALSE.
Assume x is a vector of numbers.
x = c(-1, 2, 10, -5)
idx = x > 2
idx[1] FALSE FALSE TRUE FALSE
x[idx][1] 10
We don’t have to create an explicit index object first, as R indexing is ridiculously flexible.
x[x > 2]
x[x != 3]
x[ifelse(x > 2 & x !=10, T, F)]
x[{y = idx; y}]
x[resid(lm(y ~ x)) > 0]Consider the following unfortunately coded loop:
for (i in 1:nrow(mydf)) {
check = mydf$x[i] > 2
if (check==TRUE) {
mydf$y[i] = 'Yes'
}
else {
mydf$y[i] = 'No'
}
}Compare:
mydf$y = 'No'
mydf$y[mydf$x > 2] = 'Yes'This gets us the same thing, and would be much faster. Boolean indexing is an example of a vectorized operation.
The whole vector is considered:
This is always faster.
Example: Log all values in a matrix.
mymatrix_log = log(mymatrix)Way faster than looping over elements, rows or columns. Here we’ll let the apply function stand in for our loop, logging the elements of each column.
mymatrix = matrix(runif(100), 10, 10)
identical(apply(mymatrix, 2, log), log(mymatrix))[1] TRUE
library(microbenchmark)
microbenchmark(apply(mymatrix, 2, log),
log(mymatrix))Unit: microseconds
expr min lq mean median uq max neval
apply(mymatrix, 2, log) 41.726 43.527 49.22175 44.728 51.932 186.116 100
log(mymatrix) 2.701 3.002 3.35024 3.002 3.302 11.708 100
Many vectorized functions already exist in R.
They are often written in C, Fortran etc., and so even faster.
A family of functions allows for a succinct way of looping.
Common ones include:
Standardizing variables.
for (i in 1:ncol(mydf)){
x = mydf[,i]
for (j in 1:length(x)){
x[j] = (x[j] - mean(x))/sd(x)
}
}The above would be a really bad way to use R. The following is much cleaner and now you’d have a function you can use elsewhere.
apply will take a matrix, and apply a function over the margin, row or column, you want.
stdize <- function(x) {
(x-mean(x))/sd(x)
}
apply(mydf, 2, stdize) # 1 for rows, 2 for columnwise applicationThe previous demonstrates how to use apply.
However, there is a scale function in base R that uses a more vectorized approach under the hood.
The following demonstrates various approaches to standardizing the columns of the matrix, even using a parallelized approach.
The base R function requires very little code and beats the others.
mymat = matrix(rnorm(100000), ncol=1000)
stdize <- function(x) {
(x-mean(x)) / sd(x)
}
doubleloop = function() {
for (i in 1:ncol(mymat)) {
x = mymat[, i]
for (j in 1:length(x)) {
x[j] = (x[j] - mean(x)) / sd(x)
}
}
}
singleloop = function() {
for (i in 1:ncol(mymat)) {
x = mymat[, i]
x = (x - mean(x)) / sd(x)
}
}
library(parallel)
cl = makeCluster(8)
clusterExport(cl, c('stdize', 'mymat'))
doParallel::registerDoParallel(cl)
test = microbenchmark::microbenchmark(doubleloop=doubleloop(),
singleloop=singleloop(),
apply=apply(mymat, 2, stdize),
parApply=parApply(cl, mymat, 2, stdize),
vectorized=scale(mymat), times=25)
stopCluster(cl)
testUnit: milliseconds
expr min lq mean median uq max neval
doubleloop 2882.148089 2904.739076 2938.21281 2946.57065 2952.25532 3013.29805 25
singleloop 28.989806 29.512434 33.08177 31.17037 32.42606 84.79185 25
apply 32.167901 33.350945 33.96146 34.13204 34.64055 36.18983 25
parApply 17.691611 18.213638 21.91323 19.63893 21.12156 73.60683 25
vectorized 8.085578 8.491433 10.25583 10.66990 11.02532 12.62173 25
Benefits
NOT necessarily faster than explicit loops.
ALWAYS can potentially be faster than loops.
I use R every day, and rarely use explicit loops.
I pretty much never use an explicit double loop, as a little more thinking about the problem will usually provide a more efficient path to solving the problem.
Apply functions and similar approaches should be a part of your regular R experience.
Other versions we’ll talk about have been optimized, but you need to know the basics in order to use those.
Any you still may need parallel versions.
With the following matrix, use apply and the sum function to get row or column sums.
x = matrix(1:9, 3, 3)With the following list, use lapply and sapply and the sum function to get sums for the elements. There is no margin to specify with on a list, so just supply the list and the sum function.
x = list(1:3, 4:6, 7:9)sapply is actually just a wrapper for lapply. If you supply the argument simplified=F, it is identical. Otherwise, it attempts to return a vector or matrix.
Note: More detail on much of the following is given in another workshop.
Pipes are operators that send what comes before to what comes after.
There are many different pipes, and some packages that use their own.
However, the vast majority of packages use the same pipe:
Here, we’ll focus on their use with the dplyr package. Later, we’ll use it for visualizations.
Example:
mydf %>%
select(var1, var2) %>%
filter(var1 == 'Yes') %>%
summaryStart with a data.frame %>%
   select columns from it %>%
   filter/subset it %>%
   get a summary
One nice thing about pipelines is that we can use variables as soon as they are created, with out having to break out separate objects/steps.
mydf %>%
mutate(newvar1 = var1 + var2,
newvar2 = newvar1/var3) %>%
summarise(newvar2avg = mean(newvar2))The following provides a means to think about pipes for visualization. It’s just a generic example for now, but we’ll see more later.
basegraph %>%
points %>%
lines %>%
layoutMost functions are not ‘pipe-aware’ by default.
Example: pipe to a modeling function.
mydf %>%
lm(y ~ x) # errorOther pipes could work in this situation.
But generally, when you come upon this situation, you can use a dot.
mydf %>%
lm(y ~ x, data=.) # . == mydfPiping is not just for data.frames.
.).c('Ceci', "n'est", 'pas', 'une', 'pipe!') %>%
{
.. <- . %>%
if (length(.) == 1) .
else paste(.[1], '%>%', ..(.[-1]))
..(.)
} [1] "Ceci %>% n'est %>% pas %>% une %>% pipe!"
Put that in your pipe and smoke it René Magritte!
Pipes are best used interactively.
Extremely useful for data exploration.
Common in many visualization packages.
See the magrittr package for more pipes.
The tidyverse consists of a few key packages-
And of course the tidyverse package which will load all of the above.
See also: lubridate, rvest, stringr and others in the ‘hadleyverse’.
Tidy data refers to data arranged in a way that makes data processing, analysis, and visualization simpler.
In a tidy data set:
Think long before wide.
dplyr provides a grammar of data manipulation (like ggplot2 does for visualization).
It is the next iteration of plyr, but there is no longer any need for plyr really.
It’s focused on tools for working with data frames.
It has three main goals:
Some key operations:
Various join/merge functions:
Little things like:
No need to quote variable names.
Let’s say we want to select from our data the following variables:
How might we go about this?
Tedious, or multiple objects just to get the columns you want.
# numeric indexes; not conducive to readibility or reproducibility
newData = oldData[,c(1,2,3,4, etc.)]
# explicitly by name; fine if only a handful; not pretty
newData = oldData[,c('ID','X1', 'X2', etc.)]
# two step with grep; regex difficult to read/understand
cols = c('ID', paste0('X', 1:10), 'var1', 'var2', grep(colnames(oldData), '^XYZ', value=T))
newData = oldData[,cols]
# or via subset
newData = subset(oldData, select = cols)What if you also want observations where Z is Yes, Q is No, and only the observations with the top 50 values of var2, ordered by var1 (descending)?
# three operations and overwriting or creating new objects if we want clarity
newData = newData[oldData$Z == 'Yes' & oldData$Q == 'No',]
newData = newData[order(newData$var2, decreasing=T)[1:50],]
newData = newData[order(newData$var1, decreasing=T),]And this is for fairly straightforward operations.
newData = oldData %>%
filter(Z == 'Yes', Q == 'No') %>%
select(num_range('X', 1:10), contains('var'), starts_with('XYZ')) %>%
top_n(n=50, var2) %>%
arrange(desc(var1))dplyr and piping is an alternative
Even though the initial base R approach depicted is fairly concise, it still can potentially be:
Two primary functions for manipulating data
Other useful functions include:
library(tidyr)
stocks <- data.frame( time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4) )
stocks %>% head time X Y Z
1 2009-01-01 -0.6162083 -3.64834112 -8.1590601
2 2009-01-02 -0.6380109 -3.45814122 -1.8638771
3 2009-01-03 -1.4380080 -0.09853711 2.3541401
4 2009-01-04 1.5803772 -1.28003978 3.5116183
5 2009-01-05 1.2278476 -0.12016253 -1.7042857
6 2009-01-06 0.4709159 -0.91009052 0.2849541
stocks %>% gather(stock, price, -time) %>% head time stock price
1 2009-01-01 X -0.6162083
2 2009-01-02 X -0.6380109
3 2009-01-03 X -1.4380080
4 2009-01-04 X 1.5803772
5 2009-01-05 X 1.2278476
6 2009-01-06 X 0.4709159
Note that the latter is an example of tidy data while the former is not.
The dplyr grammar is clear for a lot of standard data processing tasks, and some not so common.
Extremely useful for data exploration and visualization.
Drawbacks:
multidplyr
Install and load the dplyr ggplot2movies packages. Look at the help file for the movies data set, which contains data from IMDB.
install.packages('ggplot2movies')
library(ggplot2movies)Using the movies data set, perform each of the following actions separately.
Use mutate to create a centered version of the rating variable. A centered variable is one whose mean has been subtracted from it. The process will take the following form:
data %>%
mutate(newvar = '?')Use filter to create a new data frame that has only movies from the years 2000 and beyond. Use the greater than or equal operator >=.
Use select to create a new data frame that only has the title, year, budget, length, rating and votes variables. There are at least 3 ways to do this.
Use group_by to group the data by year, and summarize to create a new variable that is the average budget. The summarize function works just like mutate in this case. Use the mean function to get the average, but you’ll also need to use the argument na.rm = TRUE within it because the earliest years have no budget recorded.
Now put it all together in one set of piped operations.
mpaa, Action, and Drama variablesmpaa and Action or Dramadata.table works in a notably different way than dplyr. However, you’d use it for the same reasons.
Like dplyr, the data objects are both data.frames and a package specific class.
Faster subset, grouping, update, ordered joins etc.
In general, data.table works with brackets as in base R.
However, the brackets work like a function call!
x[i, j, by, keyby, with = TRUE, ...]Importantly: you don’t use the brackets as you would with data.frames.
library(data.table)
dt = data.table(x=sample(1:10, 6), g=1:3, y=runif(6))
class(dt)[1] "data.table" "data.frame"
What i and j can be are fairly complex.
In general, you use i for filtering by rows.
dt[2]
dt[2,] x g y
1: 5 2 0.6744818
x g y
1: 5 2 0.6744818
In general, you use j to select (by name!) or create new columns.
dt[,x]
dt[,z := x+y] # dt now has a new column
dt[,z]
dt[g>1, mean(z), by=g]
dt[1] 1 5 2 9 8 7
x g y z
1: 1 1 0.47678295 1.476783
2: 5 2 0.67448180 5.674482
3: 2 3 0.74166616 2.741666
4: 9 1 0.07891256 9.078913
5: 8 2 0.28856876 8.288569
6: 7 3 0.12940937 7.129409
[1] 1.476783 5.674482 2.741666 9.078913 8.288569 7.129409
g V1
1: 2 6.981525
2: 3 4.935538
x g y z
1: 1 1 0.47678295 1.476783
2: 5 2 0.67448180 5.674482
3: 2 3 0.74166616 2.741666
4: 9 1 0.07891256 9.078913
5: 8 2 0.28856876 8.288569
6: 7 3 0.12940937 7.129409
Dropping columns is awkward.
dt[,-y] # creates negative values of y
dt[,-'y', with=F] # drops y, but now needs quotes
## dt[,y:=NULL] # drops y, but this is just a base R approach
## dt$y = NULL[1] -0.47678295 -0.67448180 -0.74166616 -0.07891256 -0.28856876 -0.12940937
x g z
1: 1 1 1.476783
2: 5 2 5.674482
3: 2 3 2.741666
4: 9 1 9.078913
5: 8 2 8.288569
6: 7 3 7.129409
We can now attempt a ‘group-by’ operation, along with creation of a new variable.
Note that these operations actually modify dt in place, a key distinction with dplyr.
dt1 = dt2 = dt
dt[,sum(x,y), by=g] # sum of all x and y values g V1
1: 1 10.555696
2: 2 13.963051
3: 3 9.871076
dt1[,newvar := sum(x), by=g] # add new variable to the original data x g y z newvar
1: 1 1 0.47678295 1.476783 10
2: 5 2 0.67448180 5.674482 13
3: 2 3 0.74166616 2.741666 9
4: 9 1 0.07891256 9.078913 10
5: 8 2 0.28856876 8.288569 13
6: 7 3 0.12940937 7.129409 9
We can also create groupings on the fly.
For a new summary data set, we’ll take the following approach.
dt2[, list(meanx = mean(x), sumx = sum(x)), by=g==1] g meanx sumx
1: TRUE 5.0 10
2: FALSE 5.5 22
dt1[dt2]The following demonstrates some timings from here.
By the way, never, ever use aggregate. For anything.
fun elapsed
1: aggregate 114.35
2: by 24.51
3: sapply 11.62
4: tapply 11.33
5: dplyr 10.97
6: lapply 10.65
7: data.table 2.71
Ever.
Really.
Can be done but awkward at best.
mydt[,newvar:=mean(x),][,newvar2:=sum(newvar), by=group][,-'y', with=FALSE]
mydt[,newvar:=mean(x),
][,newvar2:=sum(newvar), by=group
][,-'y', with=FALSE
]Probably better to just use a pipe and dot approach.
mydt[,newvar:=mean(x),] %>%
.[,newvar2:=sum(newvar), by=group] %>%
.[,-'y', with=FALSE]Faster methods are great to have.
Drawbacks:
If speed and/or memory is (potentially) a concern, data.table
For interactive exploration, dplyr
Piping allows one to use both, so no need to choose.
And on the horizon…
Coming soon to an R near you.
This implements the data table back-end for ‘dplyr’ so that you can seamlessly use data table and ‘dplyr’ together.
Or play with now. The following shows times for a grouped operation of a data frame of two variables, a random uniform draw of 5e7 values, and a grouping variable of 500k groups.
| package | timing |
|---|---|
| dplyr | 10.97 |
| data.table | 2.71 |
| dtplyr | 2.70 |
Just for giggles I did the same in python with a pandas DataFrame and groupby operation, and it took 7+ seconds.
Install and load the data.table package.
Create the following data table.
mydt = data.table(expand.grid(x=1:3, y=c('a','b','c')), z=sample(1:20, 9))
Create a new object that contains only the ‘a’ group. Think back to how you use a logical to select rows.
Create a new object that is the sum of z grouped by x. You don’t need to name the sum variable.
ggplot2 is an extremely popular package for visualization in R.
It entails a grammar of graphics.
Key ideas:
Strengths:
Aesthetics map data to aesthetic aspects of the plot.
The function used in ggplot to do this is aes
aes(x=myvar, y=myvar2, color=myvar3, group=g)In general, we start with a base layer and add to it.
In most cases you’ll start as follows.
ggplot(aes(x=myvar, y=myvar2), data=mydata)This would just produce a plot background.
Layers are added via piping.
The first layers added are typically geoms:
ggplot2 was using pipes before it was cool, and so it has a different pipe.
Otherwise, the concept is the same as before.
ggplot(aes(x=myvar, y=myvar2), data=mydata) +
geom_point()And now we would have a scatterplot.
library(ggplot2)
data("diamonds"); data('economics')
ggplot(aes(x=carat, y=price), data=diamonds) +
geom_point()ggplot(aes(x=date, y=unemploy), data=economics) +
geom_line()In the following, one setting is not mapped to the data.
ggplot(aes(x=carat, y=price), data=diamonds) +
geom_point(aes(size=carat, color=clarity), alpha=.25) There are many statistical functions built in.
Key strength: you don’t have to do much preprocessing.
Quantile regression lines:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_quantile()Loess (or additive model) smooth:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth()Bootstrapped confidence intervals:
ggplot(mtcars, aes(cyl, mpg)) +
geom_point() +
stat_summary(fun.data = "mean_cl_boot", colour = "orange", alpha=.75, size = 1)Facets allow for paneled display, a very common operation.
In general, we often want comparison plots.
facet_grid will produce a grid.
facet_wrap is more flexible.
Both use a formula approach to specify the grouping.
ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
facet_grid(vs ~ cyl, labeller = label_both)ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
facet_wrap(vs ~ cyl, labeller = label_both, ncol=2)ggplot2 makes it easy to get good looking graphs quickly.
However the amount of fine control is extensive.
ggplot(aes(x=carat, y=price), data=diamonds) +
geom_point(aes(color=clarity), alpha=.5) +
scale_y_log10(breaks=c(1000,5000,10000)) +
xlim(0, 10) +
scale_color_brewer(type='div') +
facet_wrap(~cut, ncol=3) +
theme_minimal() +
theme(axis.ticks.x=element_line(color='darkred'),
axis.text.x=element_text(angle=-45),
axis.text.y=element_text(size=20),
strip.text=element_text(color='forestgreen'),
strip.background=element_blank(),
panel.grid.minor=element_line(color='lightblue'),
legend.key=element_rect(linetype=4),
legend.position='bottom')In the last example you saw two uses of a theme.
Each argument takes on a specific value or an element function:
The base theme is not too good.
You will almost invariably need to tweak it.
ggplot2 now has its own extension system, and there is even a website to track the extensions.
Examples include:
ggplot2 is an easy to use, but powerful visualization tool.
Allows one to think in many dimensions for any graph:
2d graphs are only useful for conveying the simplest of ideas.
Use ggplot2 to easily create more interesting visualizations.
Install and load the ggplot2 package.
Create two plots, one a scatterplot (e.g. with geom_point) and one with lines (e.g. geom_line) with a data set of your choosing (all of the following are base R or available after loading ggplot2. Some suggestions:
Recall the basic form for ggplot.
ggplot(aes(x=*, y=*, other), data=*) +
geom_*() +
otherLayers, theme etc.
Themes to play with:
Play around and change the arguments to the following. You’ll need to install the maps package. For example, do points for all county midpoints, with different colors.
library(maps)
mi = map_data("county", "michigan")
seats = mi %>%
group_by(subregion) %>%
summarise_at(vars(lat, long), function(x) median(range(x)))
ggplot(mi, aes(long, lat)) +
geom_polygon(aes(group = group), fill = NA, colour = "grey60") +
geom_text(aes(label = subregion), data = seats, size = 1, angle = 45) +
geom_point(y=42.281389, x=-83.748333, color='#1e90ff', size=3) +
theme_minimal() +
theme(panel.grid=element_blank())ggplot2 is the most widely used package for visualization in R.
However, it is not interactive by default.
Many packages use htmlwidgets, d3 (JavaScript library) etc. to provide interactive graphics.
General:
Specific functionality:
One of the advantages to piping is that it’s not limited to dplyr style data management functions.
Any R function can be potentially piped to.
This facilitates data exploration, especially visually.
Many newer visualization packages take advantage of piping.
htmlwidgets is a package that makes it easy to create javascript visualizations.
The packages using it typically are pipe-oriented and produce interactive plots.
A couple demonstrations with plotly.
Note the layering as with ggplot2.
Piping used before plotting.
library(plotly)
midwest %>%
filter(inmetro==T) %>%
plot_ly(x=~percbelowpoverty, y=~percollege, mode='markers') plotly has modes, which allow for points, lines, text and combinations.
Traces work similar to geoms.
library(mgcv); library(modelr)
mtcars %>%
mutate(amFactor = factor(am, labels=c('auto', 'manual')),
hovertext = paste('weight:',wt, '<br>', 'mgp:', mpg, '<br>', amFactor)) %>%
add_predictions(gam(mpg~s(wt, am, bs='fs'), data=mtcars)) %>%
arrange(wt) %>%
plot_ly() %>%
add_markers(x=~wt, y=~mpg, color=~amFactor, text=~hovertext, hoverinfo='text') %>%
add_lines(x=~wt, y=~pred, color=~amFactor, alpha=.5, name='gam prediction', showlegend=F)The nice thing about plotly is that we can feed a ggplot to it.
It would have been easy to use geom_smooth to get a similar result, so let’s do so.
gp = mtcars %>%
mutate(amFactor = factor(am, labels=c('auto', 'manual')),
hovertext = paste(wt, mpg, amFactor)) %>%
arrange(wt) %>%
ggplot(aes(x=wt, y=mpg, color=amFactor)) +
geom_smooth(se=F) +
geom_point(aes(color=amFactor))
ggplotly()Highcharter is also fairly useful for a wide variety of plots.
library(highcharter); library("quantmod")
x <- getSymbols("GOOG", auto.assign = FALSE)
highchart(type='stock', width=1000) %>%
hc_add_series(data = x) visNetwork allows for network visualizations
library(visNetwork)
visNetwork(nodes, edges) %>% #, height=600, width=800
visNodes(shape='circle',
font=list(),
scaling=list(min=10, max=50, label=list(enable=T))) %>%
visLegend()Use the DT package for interactive data frames.
library(DT)
ggplot2movies::movies %>%
select(1:6) %>%
filter(rating>8, !is.na(budget), votes > 1000) %>%
datatable()Shiny is a framework that can essentially allow you to build an interactive website.
Most of the more recently developed visualization packages will work specifically within the shiny and rmarkdown settings.
Interactivity allows for even more dimensions to be brought to a graphic.
Interactive graphics are more fun too!
Just a couple visualization packages can go a very long way.
Install and load the plotly package. Load the dplyr and ggplot2 packages if necessary.
Using dplyr group by year, and summarize to create a new variable that is the Average rating. Then create a plot with plotly for a line or scatter plot (for the latter, use the add_markers function). It will take the following form:
data %>%
group_by() %>%
summarize() %>%
plot_ly() %>%
add_markers()
This time group by year and Drama. In the summarize create average rating again, but also a variable representing the average number of votes. In your plotly line, use the size and color arguments to represent whether the average number of votes and whether it was drama or not respectively. Use add_markers.
Create a ggplot of your design and then use ggplotly to make it interactive.
With the right tools, data exploration can be:
Use them to wring your data dry of what it has to offer.
Recommended next steps: